None
Project Goal: Home Credit will be able to identify if a customer is a safe candidate to lend to, then create a personalized customer loan and repayment plan to be accountable for, resulting in an increase in revenue, improved customer experience, and lower default rates.
Business Problem: Home Credit desires to know safe borrowers in a customer base that is unfamiliar with banking and give the customer a plan for successful loan repayment. Lending to those who are more likely to default on loans decreases the profits of Home Credit and results in negative customer experiences.
Analytic Problem:
The target variable is specificially customers that do have a negative history of repayment to lend to, and postive repayment. Represented in the application_train/test.csv sets of binary where 1 = Not trust worthy borrower (Client with payment difficulties), 0 = Trustworthy borrower (client with good repayment history).
Predict which customers will be good borrowers, using a classification method based on customer financial behavior data.
Use a regression method to see the relationship that a trust worthy customer has to other attributes about them.
* Main dataset. This file contains the information of the target variable as binary for trustworthy to loan to.
| Column | Description |
|---|---|
| SK_ID_CURR | ID of loan in our sample |
| TARGET | Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases) |
| NAME_CONTRACT_TYPE | Identification if loan is cash or revolving |
| CODE_GENDER | Gender of the client |
| FLAG_OWN_CAR | Flag if the client owns a car |
| FLAG_OWN_REALTY | Flag if the client owns a house or flat |
| CNT_CHILDREN | Number of children the client has |
| AMT_INCOME_TOTAL | Income of the client |
| AMT_CREDIT | Credit amount of the loan |
| AMT_ANNUITY | Loan annuity |
| AMT_GOODS_PRICE | For consumer loans it is the price of the goods for which the loan is given |
| NAME_TYPE_SUITE | Who was accompanying client when he was applying for the loan |
| NAME_INCOME_TYPE | Clients income type (businessman, working, maternity leave,...) |
| NAME_EDUCATION_TYPE | Level of highest education the client achieved |
| NAME_FAMILY_STATUS | Family status of the client |
| NAME_HOUSING_TYPE | What is the housing situation of the client (renting, living with parents, ...) |
| REGION_POPULATION_RELATIVE | Normalized population of region where client lives (higher number means the client lives in... |
| DAYS_BIRTH | Client's age in days at the time of application |
| DAYS_EMPLOYED | How many days before the application the person started current employment |
| DAYS_REGISTRATION | How many days before the application did client change his registration |
| DAYS_ID_PUBLISH | How many days before the application did client change the identity document with which he... |
| OWN_CAR_AGE | Age of client's car |
| FLAG_MOBIL | Did client provide mobile phone (1=YES, 0=NO) |
| FLAG_EMP_PHONE | Did client provide work phone (1=YES, 0=NO) |
| FLAG_WORK_PHONE | Did client provide home phone (1=YES, 0=NO) |
| FLAG_CONT_MOBILE | Was mobile phone reachable (1=YES, 0=NO) |
| FLAG_PHONE | Did client provide home phone (1=YES, 0=NO) |
| FLAG_EMAIL | Did client provide email (1=YES, 0=NO) |
| OCCUPATION_TYPE | What kind of occupation does the client have |
| CNT_FAM_MEMBERS | How many family members does client have |
| REGION_RATING_CLIENT | Our rating of the region where client lives (1,2,3) |
| REGION_RATING_CLIENT_W_CITY | Our rating of the region where client lives with taking city into account (1,2,3) |
| WEEKDAY_APPR_PROCESS_START | On which day of the week did the client apply for the loan |
| HOUR_APPR_PROCESS_START | Approximately at what hour did the client apply for the loan |
| REG_REGION_NOT_LIVE_REGION | Flag if client's permanent address does not match contact address (1=different, 0=same, at region level |
| REG_REGION_NOT_WORK_REGION | Flag if client's permanent address does not match work address (1=different, 0=same, at region level) |
| LIVE_REGION_NOT_WORK_REGION | Flag if client's contact address does not match work address (1=different, 0=same, at region level) |
| REG_CITY_NOT_LIVE_CITY | Flag if client's permanent address does not match contact address (1=different, 0=same, at city level) |
| REG_CITY_NOT_WORK_CITY | Flag if client's permanent address does not match work address (1=different, 0=same, at city level) |
| LIVE_CITY_NOT_WORK_CITY | Flag if client's contact address does not match work address (1=different, 0=same, at city level) |
| ORGANIZATION_TYPE | Type of organization where client works |
| EXT_SOURCE_1 | Normalized score from external data source |
| EXT_SOURCE_2 | Normalized score from external data source |
| EXT_SOURCE_3 | Normalized score from external data source |
| APARTMENTS_AVG | Normalized information about building where the client lives, average apartment size, common area, etc. |
| BASEMENTAREA_AVG | Normalized information about building where the client lives, average basement area, common area, etc. |
| YEARS_BEGINEXPLUATATION_AVG | Normalized information about building where the client lives, average years of building exploitation, etc. |
| YEARS_BUILD_AVG | Normalized information about building where the client lives, average years since building construction, etc. |
| COMMONAREA_AVG | Normalized information about building where the client lives, average common area, etc. |
| ELEVATORS_AVG | Normalized information about building where the client lives, average number of elevators, etc. |
| ENTRANCES_AVG | Normalized information about building where the client lives, average number of entrances, etc. |
| FLOORSMAX_AVG | Normalized information about building where the client lives, average maximum number of floors, etc. |
| FLOORSMIN_AVG | Normalized information about building where the client lives, average minimum number of floors, etc. |
| LANDAREA_AVG | Normalized information about building where the client lives, average land area, etc. |
| LIVINGAPARTMENTS_AVG | Normalized information about building where the client lives, average living area of apartments, etc. |
| LIVINGAREA_AVG | Normalized information about building where the client lives, average living area, etc. |
| NONLIVINGAPARTMENTS_AVG | Normalized information about building where the client lives, average non-living area of apartments, etc. |
| NONLIVINGAREA_AVG | Normalized information about building where the client lives, average non-living area, etc. |
| APARTMENTS_MODE | Normalized information about building where the client lives, modus (most common) apartment size, common area, etc. |
| BASEMENTAREA_MODE | Normalized information about building where the client lives, modus (most common) basement area, common area, etc. |
| YEARS_BEGINEXPLUATATION_MODE | Normalized information about building |
| YEARS_BUILD_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| COMMONAREA_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| ELEVATORS_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| ENTRANCES_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| FLOORSMAX_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| FLOORSMIN_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| LANDAREA_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| LIVINGAPARTMENTS_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| LIVINGAREA_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| NONLIVINGAPARTMENTS_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| NONLIVINGAREA_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| APARTMENTS_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| BASEMENTAREA_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| YEARS_BEGINEXPLUATATION_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| YEARS_BUILD_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| COMMONAREA_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| ELEVATORS_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| ENTRANCES_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| FLOORSMAX_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| FLOORSMIN_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| LANDAREA_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| LIVINGAPARTMENTS_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| LIVINGAREA_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| NONLIVINGAPARTMENTS_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| NONLIVINGAREA_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor |
| FLOORSMIN_MODE | Normalized information about building where the client lives, such as the mode (most common) value for the number of floors in the building |
| FLOORSMIN_MEDI | Normalized information about building where the client lives, such as the median value for the number of floors in the building |
| FLOORSMAX_MODE | Normalized information about building where the client lives, such as the mode (most common) value for the highest floor in the building |
| FLOORSMAX_MEDI | Normalized information about building where the client lives, such as the median value for the highest floor in the building |
| YEARS_BUILD_MODE | Normalized information about building where the client lives, such as the mode (most common) value for the age of the building |
| YEARS_BUILD_MEDI | Normalized information about building where the client lives, such as the median value for the age of the building |
| COMMONAREA_MODE | Normalized information about the common areas in the building where the client lives, such as the mode (most common) value for the common area size |
| COMMONAREA_MEDI | Normalized information about the common areas in the building where the client lives, such as the median value for the common area size |
| ELEVATORS_MODE | Normalized information about the elevators in the building where the client lives, such as the mode (most common) value for the number of elevators |
| ELEVATORS_MEDI | Normalized information about the elevators in the building where the client lives, such as the median value for the number of elevators |
| ENTRANCES_MODE | Normalized information about the entrances in the building where the client lives, such as the mode (most common) value for the number of entrances |
| ENTRANCES_MEDI | Normalized information about the entrances in the building where the client lives, such as the median value for the number of entrances |
| LANDAREA_MODE | Normalized information about the land area of the building where the client lives, such as the mode (most common) value for the land area |
| LANDAREA_MEDI | Normalized information about the land area of the building where the client lives, such as the median value for the land area |
| LIVINGAPARTMENTS_MODE | Normalized information about the living apartments in the building where the client lives, such as the mode (most common) value for the apartment size |
| LIVINGAPARTMENTS_MEDI | Normalized information about the living apartments in the building where the client lives, such as the median value for the apartment size |
| LIVINGAREA_MODE | Normalized information about the living area in the building where the client lives, such as the mode (most common) value for the living area size |
| LIVINGAREA_MEDI | Normalized information about the living area in the building where the client lives, such as the median value for the living area size |
| NONLIVINGAPARTMENTS_MODE | Normalized information about the non-living apartments in the building where the client lives, such as the mode (most common) value for the apartment size |
| NONLIVINGAPARTMENTS_MEDI | Normalized information about the non-living apartments in the building where the client lives, such as the median value for the apartment size |
| NONLIVINGAREA_MODE | Normalized information about the non-living area in the building where the client lives, such as the mode (most common) value for the non-living area size |
| NONLIVINGAREA_MEDI | Normalized information about the non-living area in the building where the client lives, such as the median value for the non-living area size |
| FONDKAPREMONT_MODE | Normalized information about the fund kapremont (capital repairs) mode for the building where the client lives |
| HOUSETYPE_MODE | Normalized information about the type of house in which the client lives |
| TOTALAREA_MODE | Normalized information about the total area of the building where the client lives, such as the mode (most common) value for the total area |
| WALLSMATERIAL_MODE | Normalized information about the wall material used in the building where the client lives |
| EMERGENCYSTATE_MODE | Normalized information about the emergency state of the building where the client lives |
| OBS_30_CNT_SOCIAL_CIRCLE | Number of observations of clients' social surroundings with observable 30 DPD (days past due) default |
| DEF_30_CNT_SOCIAL_CIRCLE | Number of clients' social surroundings with 30 DPD (days past due) default |
| OBS_60_CNT_SOCIAL_CIRCLE | Number of observations of clients' social surroundings with observable 60 DPD (days past due) default |
| DEF_60_CNT_SOCIAL_CIRCLE | Number of clients' social surroundings with 60 DPD (days past due) default |
| DAYS_LAST_PHONE_CHANGE | Number of days since the client last changed their phone number |
| FLAG_DOCUMENT_2 | Indicates if there is a second document provided by the client |
| FLAG_DOCUMENT_3 | Indicates if there is a third document provided by the client |
| FLAG_DOCUMENT_4 | Indicates if there is a fourth document provided by the client |
| FLAG_DOCUMENT_5 | Indicates if there is a fifth document provided by the client |
| FLAG_DOCUMENT_6 | Indicates if there is a sixth document provided by the client |
| FLAG_DOCUMENT_7 | Indicates if there is a seventh document provided by the client |
| FLAG_DOCUMENT_8 | Indicates if there is an eighth document provided by the client |
| FLAG_DOCUMENT_9 | Indicates if there is a ninth document provided by the client |
| FLAG_DOCUMENT_10 | Indicates if there is a tenth document provided by the client |
| FLAG_DOCUMENT_11 | Indicates if there is an eleventh document provided by the client |
| FLAG_DOCUMENT_12 | Indicates if there is a twelfth document provided by the client |
| FLAG_DOCUMENT_13 | Indicates if there is a thirteenth document provided by the client |
| FLAG_DOCUMENT_14 | Indicates if there is a fourteenth document provided by the client |
| FLAG_DOCUMENT_15 | Indicates if there is a fifteenth document provided by the client |
| FLAG_DOCUMENT_16 | Indicates if there is a sixteenth document provided by the client |
| FLAG_DOCUMENT_17 | Indicates if there is a seventeenth document provided by the client |
| FLAG_DOCUMENT_18 | Indicates if there is an eighteenth document provided by the client |
| FLAG_DOCUMENT_19 | Indicates if there is a nineteenth document provided by the client |
| FLAG_DOCUMENT_20 | Indicates if there is a twentieth document provided by the client |
| FLAG_DOCUMENT_21 | Indicates if there is a twenty-first document provided by the client |
| AMT_REQ_CREDIT_BUREAU_HOUR | Number of enquiries to Credit Bureau about the client one hour before application |
| AMT_REQ_CREDIT_BUREAU_DAY | Number of enquiries to Credit Bureau about the client one day before application (excluding one hour before application) |
| AMT_REQ_CREDIT_BUREAU_WEEK | Number of enquiries to Credit Bureau about the client one week before application (excluding one day before application) |
| AMT_REQ_CREDIT_BUREAU_MON | Number of enquiries to Credit Bureau about the client one month before application (excluding one week before application) |
| AMT_REQ_CREDIT_BUREAU_QRT | Number of enquiries to Credit Bureau about the client three months before application (excluding one month before application) |
| AMT_REQ_CREDIT_BUREAU_YEAR | Number of enquiries to Credit Bureau about the client one year (excluding last three months before application) |
application_test.csv:
| Column | Description |
|---|---|
| SK_ID_CURR | ID of loan in our sample - one loan in our sample can have 0, 1, 2 or more related previous credits in credit bureau |
| SK_BUREAU_ID | Recoded ID of previous Credit Bureau credit related to our loan (unique coding for each loan application) |
| CREDIT_ACTIVE | Status of the Credit Bureau (CB) reported credits |
| CREDIT_CURRENCY | Recoded currency of the Credit Bureau credit |
| DAYS_CREDIT | How many days before current application did client apply for Credit Bureau credit |
| CREDIT_DAY_OVERDUE | Number of days past due on CB credit at the time of application for related loan in our sample |
| DAYS_CREDIT_ENDDATE | Remaining duration of CB credit (in days) at the time of application in Home Credit |
| DAYS_ENDDATE_FACT | Days since CB credit ended at the time of application in Home Credit (only for closed credit) |
| AMT_CREDIT_MAX_OVERDUE | Maximal amount overdue on the Credit Bureau credit so far (at application date of loan in our sample) |
| CNT_CREDIT_PROLONG | How many times was the Credit Bureau credit prolonged |
| AMT_CREDIT_SUM | Current credit amount for the Credit Bureau credit |
| AMT_CREDIT_SUM_DEBT | Current debt on Credit Bureau credit |
| AMT_CREDIT_SUM_LIMIT | Current credit limit of credit card reported in Credit Bureau |
| AMT_CREDIT_SUM_OVERDUE | Current amount overdue on Credit Bureau credit |
| CREDIT_TYPE | Type of Credit Bureau credit (Car, cash, ...) |
| DAYS_CREDIT_UPDATE | How many days before loan application did last information about the Credit Bureau credit come |
| AMT_ANNUITY | Annuity of the Credit Bureau credit |
| Columns | Description |
|---|---|
| SK_BUREAU_ID | Recoded ID of Credit Bureau credit (unique coding for each application) - use this to join to CREDIT_BUREAU table |
| MONTHS_BALANCE | Month of balance relative to application date (-1 means the freshest balance date) |
| STATUS | Status of Credit Bureau loan during the month (active, closed, DPD0-30,… [C means closed, X means status unknown, 0 means no DPD, 1 means maximal did during month between 1-30, 2 means DPD 31-60,… 5 means DPD 120+ or sold or written off ]) |
| Columns | Description |
|---|---|
| SK_ID_PREV | ID of previous credit in Home Credit related to loan in our sample. (One loan in our sample can have 0, 1, 2 or more previous loans in Home Credit) |
| SK_ID_CURR | ID of loan in our sample |
| MONTHS_BALANCE | Month of balance relative to application date (-1 means the freshest balance date) |
| AMT_BALANCE | Balance during the month of previous credit |
| AMT_CREDIT_LIMIT_ACTUAL | Credit card limit during the month of the previous credit |
| AMT_DRAWINGS_ATM_CURRENT | Amount drawing at ATM during the month of the previous credit |
| AMT_DRAWINGS_CURRENT | Amount drawing during the month of the previous credit |
| AMT_DRAWINGS_OTHER_CURRENT | Amount of other drawings during the month of the previous credit |
| AMT_DRAWINGS_POS_CURRENT | Amount drawing or buying goods during the month of the previous credit |
| AMT_INST_MIN_REGULARITY | Minimal installment for this month of the previous credit |
| AMT_PAYMENT_CURRENT | How much did the client pay during the month on the previous credit |
| AMT_PAYMENT_TOTAL_CURRENT | How much did the client pay during the month in total on the previous credit |
| AMT_RECEIVABLE_PRINCIPAL | Amount receivable for principal on the previous credit |
| AMT_RECIVABLE | Amount receivable on the previous credit |
| AMT_TOTAL_RECEIVABLE | Total amount receivable on the previous credit |
| CNT_DRAWINGS_ATM_CURRENT | Number of drawings at ATM during this month on the previous credit |
| CNT_DRAWINGS_CURRENT | Number of drawings during this month on the previous credit |
| CNT_DRAWINGS_OTHER_CURRENT | Number of other drawings during this month on the previous credit |
| CNT_DRAWINGS_POS_CURRENT | Number of drawings for goods during this month on the previous credit |
| CNT_INSTALMENT_MATURE_CUM | Number of paid installments on the previous credit |
| NAME_CONTRACT_STATUS | Contract status (active signed, ...) on the previous credit |
| SK_DPD | DPD (Days past due) during the month on the previous credit |
| SK_DPD_DEF | DPD (Days past due) during the month with tolerance (debts with low loan amounts are ignored) of the previous credit |
| Columns | Description |
|---|---|
| SK_ID_PREV | ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0, 1, 2 or more previous loans in Home Credit) |
| SK_ID_CURR | ID of loan in our sample |
| NUM_INSTALMENT_VERSION | Version of installment calendar (0 is for credit card) of previous credit. Change of installment version from month to month signifies that some parameter of payment calendar has changed |
| NUM_INSTALMENT_NUMBER | On which installment we observe payment |
| DAYS_INSTALMENT | When the installment of previous credit was supposed to be paid (relative to application date of current loan) |
| DAYS_ENTRY_PAYMENT | When the installments of previous credit were actually paid (relative to application date of current loan) |
| AMT_INSTALMENT | What was the prescribed installment amount of previous credit on this installment |
| AMT_PAYMENT | What the client actually paid on the previous credit on this installment |
| Columns | Description |
|---|---|
| SK_ID_PREV | ID of previous credit in Home Credit related to loan in our sample. (One loan in our sample can have 0, 1, 2, or more previous loans in Home Credit) |
| SK_ID_CURR | ID of loan in our sample |
| MONTHS_BALANCE | Month of balance relative to application date (-1 means the freshest monthly snapshot, 0 means the information at application) |
| CNT_INSTALMENT | Term of previous credit (can change over time) |
| CNT_INSTALMENT_FUTURE | Installments left to pay on the previous credit |
| NAME_CONTRACT_STATUS | Contract status during the month |
| SK_DPD | DPD (days past due) during the month of previous credit |
| SK_DPD_DEF | DPD during the month with tolerance (debts with low loan amounts are ignored) |
| Columns | Description |
|---|---|
| SK_ID_PREV | ID of previous credit in Home Credit related to loan in our sample |
| SK_ID_CURR | ID of loan in our sample |
| NAME_CONTRACT_TYPE | Contract product type (Cash loan, consumer loan [POS], ...) of the previous application |
| AMT_ANNUITY | Annuity of previous application |
| AMT_APPLICATION | For how much credit did the client ask on the previous application |
| AMT_CREDIT | Final credit amount on the previous application |
| AMT_DOWN_PAYMENT | Down payment on the previous application |
| AMT_GOODS_PRICE | Goods price of good that the client asked for (if applicable) on the previous application |
| WEEKDAY_APPR_PROCESS_START | On which day of the week did the client apply for the previous application |
| HOUR_APPR_PROCESS_START | Approximately at what hour of the day did the client apply for the previous application |
| FLAG_LAST_APPL_PER_CONTRACT | Flag if it was the last application for the previous contract |
| NFLAG_LAST_APPL_IN_DAY | Flag if the application was the last application per day of the client |
| NFLAG_MICRO_CASH | Flag for Micro finance loan |
| RATE_DOWN_PAYMENT | Down payment rate normalized on the previous credit |
| RATE_INTEREST_PRIMARY | Interest rate normalized on the previous credit |
| RATE_INTEREST_PRIVILEGED | Interest rate normalized on the previous credit |
| NAME_CASH_LOAN_PURPOSE | Purpose of the cash loan |
| NAME_CONTRACT_STATUS | Contract status (approved, cancelled, ...) of the previous application |
| DAYS_DECISION | Relative to the current application, when was the decision about the previous application made |
| NAME_PAYMENT_TYPE | Payment method that the client chose to pay for the previous application |
| CODE_REJECT_REASON | Reason why the previous application was rejected |
| NAME_TYPE_SUITE | Who accompanied the client when applying for the previous application |
| NAME_CLIENT_TYPE | Whether the client was an old or new client when applying for the previous application |
| NAME_GOODS_CATEGORY | What kind of goods did the client apply for in the previous application |
| NAME_PORTFOLIO | Whether the previous application was for CASH, POS, CAR, etc. |
| NAME_PRODUCT_TYPE | Whether the previous application was x-sell or walk-in |
| CHANNEL_TYPE | Through which channel we acquired the client on the previous application |
| SELLERPLACE_AREA | Selling area of the seller place of the previous application |
| NAME_SELLER_INDUSTRY | The industry of the seller |
| CNT_PAYMENT | Term of the previous credit at the application of the previous application |
| NAME_YIELD_GROUP | Grouped interest rate into small, medium, and high of the previous application |
| PRODUCT_COMBINATION | Detailed product combination of the previous application |
| DAYS_FIRST_DRAWING | Relative to the application date of the current application, when was the first disbursement of the previous application |
| DAYS_FIRST_DUE | Relative to the application date of the current application, when was the first due supposed to be of the previous application |
| DAYS_LAST_DUE_1ST_VERSION | Relative to the application date of the current application, when was the first due of the previous application |
| DAYS_LAST_DUE | Relative to the application date of the current application, when was the last due date of the previous application |
| DAYS_TERMINATION | Relative to the application date of the current application, when was the expected termination of the previous application |
| NFLAG_INSURED_ON_APPROVAL | Whether the client requested insurance during the previous application |
# !pip install pandas
# !pip install numpy
import pandas as pd
import numpy as np
# Making Dataframes of each of the CSV Files
applicationtestDF = pd.read_csv("application_test.csv")
applicationtrainDF = pd.read_csv("application_train.csv")
bureauDF = pd.read_csv("bureau.csv")
bureaubalanceDF = pd.read_csv("bureau_balance.csv")
creditcardbalanceDF = pd.read_csv("credit_card_balance.csv")
installmentpaymentsDF = pd.read_csv("installments_payments.csv")
poscashDF = pd.read_csv("POS_CASH_balance.csv")
previousDF = pd.read_csv("previous_application.csv")
# Printing the shape of each of the dataframes
dataframes = [applicationtestDF, applicationtrainDF, bureauDF, bureaubalanceDF, creditcardbalanceDF, installmentpaymentsDF, poscashDF, previousDF]
for df in dataframes:
print(df.shape)
(48744, 121) (307511, 122) (1716428, 17) (27299925, 3) (3840312, 23) (13605401, 8) (10001358, 8) (1670214, 37)
# Printing the head of each of the dataframes
dataframes = [applicationtestDF, applicationtrainDF, bureauDF, bureaubalanceDF, creditcardbalanceDF, installmentpaymentsDF, poscashDF, previousDF]
for df in dataframes:
print(df.head(n=5))
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY \
0 100001 Cash loans F N Y
1 100005 Cash loans M N Y
2 100013 Cash loans M Y Y
3 100028 Cash loans F N Y
4 100038 Cash loans M Y N
CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE \
0 0 135000.0 568800.0 20560.5 450000.0
1 0 99000.0 222768.0 17370.0 180000.0
2 0 202500.0 663264.0 69777.0 630000.0
3 2 315000.0 1575000.0 49018.5 1575000.0
4 1 180000.0 625500.0 32067.0 625500.0
... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 \
0 ... 0 0 0 0
1 ... 0 0 0 0
2 ... 0 0 0 0
3 ... 0 0 0 0
4 ... 0 0 0 0
AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 NaN NaN
AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 NaN NaN
AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 0.0 0.0
1 0.0 3.0
2 1.0 4.0
3 0.0 3.0
4 NaN NaN
[5 rows x 121 columns]
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR \
0 100002 1 Cash loans M N
1 100003 0 Cash loans F N
2 100004 0 Revolving loans M Y
3 100006 0 Cash loans F N
4 100007 0 Cash loans M N
FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY \
0 Y 0 202500.0 406597.5 24700.5
1 N 0 270000.0 1293502.5 35698.5
2 Y 0 67500.0 135000.0 6750.0
3 Y 0 135000.0 312682.5 29686.5
4 Y 0 121500.0 513000.0 21865.5
... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 \
0 ... 0 0 0 0
1 ... 0 0 0 0
2 ... 0 0 0 0
3 ... 0 0 0 0
4 ... 0 0 0 0
AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 NaN NaN
4 0.0 0.0
AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 NaN NaN
4 0.0 0.0
AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 0.0 1.0
1 0.0 0.0
2 0.0 0.0
3 NaN NaN
4 0.0 0.0
[5 rows x 122 columns]
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT \
0 215354 5714462 Closed currency 1 -497
1 215354 5714463 Active currency 1 -208
2 215354 5714464 Active currency 1 -203
3 215354 5714465 Active currency 1 -203
4 215354 5714466 Active currency 1 -629
CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT \
0 0 -153.0 -153.0
1 0 1075.0 NaN
2 0 528.0 NaN
3 0 NaN NaN
4 0 1197.0 NaN
AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM \
0 NaN 0 91323.0
1 NaN 0 225000.0
2 NaN 0 464323.5
3 NaN 0 90000.0
4 77674.5 0 2700000.0
AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE \
0 0.0 NaN 0.0
1 171342.0 NaN 0.0
2 NaN NaN 0.0
3 NaN NaN 0.0
4 NaN NaN 0.0
CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 Consumer credit -131 NaN
1 Credit card -20 NaN
2 Consumer credit -16 NaN
3 Credit card -16 NaN
4 Consumer credit -21 NaN
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE \
0 2562384 378907 -6 56.970
1 2582071 363914 -1 63975.555
2 1740877 371185 -7 31815.225
3 1389973 337855 -4 236572.110
4 1891521 126868 -1 453919.455
AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT \
0 135000 0.0 877.5
1 45000 2250.0 2250.0
2 450000 0.0 0.0
3 225000 2250.0 2250.0
4 450000 0.0 11547.0
AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT \
0 0.0 877.5
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 11547.0
AMT_INST_MIN_REGULARITY ... AMT_RECIVABLE AMT_TOTAL_RECEIVABLE \
0 1700.325 ... 0.000 0.000
1 2250.000 ... 64875.555 64875.555
2 2250.000 ... 31460.085 31460.085
3 11795.760 ... 233048.970 233048.970
4 22924.890 ... 453919.455 453919.455
CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT \
0 0.0 1 0.0
1 1.0 1 0.0
2 0.0 0 0.0
3 1.0 1 0.0
4 0.0 1 0.0
CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM NAME_CONTRACT_STATUS \
0 1.0 35.0 Active
1 0.0 69.0 Active
2 0.0 30.0 Active
3 0.0 10.0 Active
4 1.0 101.0 Active
SK_DPD SK_DPD_DEF
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
[5 rows x 23 columns]
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER \
0 1054186 161674 1.0 6
1 1330831 151639 0.0 34
2 2085231 193053 2.0 1
3 2452527 199697 1.0 3
4 2714724 167756 1.0 2
DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT
0 -1180.0 -1187.0 6948.360 6948.360
1 -2156.0 -2156.0 1716.525 1716.525
2 -63.0 -63.0 25425.000 25425.000
3 -2418.0 -2426.0 24350.130 24350.130
4 -1383.0 -1366.0 2165.040 2160.585
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT \
0 1803195 182943 -31 48.0
1 1715348 367990 -33 36.0
2 1784872 397406 -32 12.0
3 1903291 269225 -35 48.0
4 2341044 334279 -35 36.0
CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 45.0 Active 0 0
1 35.0 Active 0 0
2 9.0 Active 0 0
3 42.0 Active 0 0
4 35.0 Active 0 0
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION \
0 2030495 271877 Consumer loans 1730.430 17145.0
1 2802425 108129 Cash loans 25188.615 607500.0
2 2523466 122040 Cash loans 15060.735 112500.0
3 2819243 176158 Cash loans 47041.335 450000.0
4 1784265 202054 Cash loans 31924.395 337500.0
AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START \
0 17145.0 0.0 17145.0 SATURDAY
1 679671.0 NaN 607500.0 THURSDAY
2 136444.5 NaN 112500.0 TUESDAY
3 470790.0 NaN 450000.0 MONDAY
4 404055.0 NaN 337500.0 THURSDAY
HOUR_APPR_PROCESS_START ... NAME_SELLER_INDUSTRY CNT_PAYMENT \
0 15 ... Connectivity 12.0
1 11 ... XNA 36.0
2 11 ... XNA 12.0
3 7 ... XNA 12.0
4 9 ... XNA 24.0
NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING \
0 middle POS mobile with interest 365243.0
1 low_action Cash X-Sell: low 365243.0
2 high Cash X-Sell: high 365243.0
3 middle Cash X-Sell: middle 365243.0
4 high Cash Street: high NaN
DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION \
0 -42.0 300.0 -42.0 -37.0
1 -134.0 916.0 365243.0 365243.0
2 -271.0 59.0 365243.0 365243.0
3 -482.0 -152.0 -182.0 -177.0
4 NaN NaN NaN NaN
NFLAG_INSURED_ON_APPROVAL
0 0.0
1 1.0
2 1.0
3 1.0
4 NaN
[5 rows x 37 columns]
# Printing out the common columns amongst the dataframes
datasets = [applicationtrainDF, bureauDF, bureaubalanceDF, creditcardbalanceDF, installmentpaymentsDF, poscashDF, previousDF]
# Get the column names from each DataFrame
columns = [set(df.columns.str.strip().str.lower().tolist()) for df in dataframes]
# Initialize a list to store the common columns for each comparison
common_columns = []
# Iterate over the dataframes
for i in range(len(dataframes)):
for j in range(i + 1, len(dataframes)):
df1 = dataframes[i]
df2 = dataframes[j]
# Get the column names for the current pair of dataframes
columns_df1 = columns[i]
columns_df2 = columns[j]
# Find the common columns between the two dataframes
common = columns_df1.intersection(columns_df2)
# Append the common columns for the current pair
common_columns.append((common))
# Print the common column names for each pair of dataframes
for common in common_columns:
if len(common) == 0:
print("No common columns found.")
else:
print(", ".join(common))
flag_document_18, flag_email, cnt_fam_members, flag_document_13, name_income_type, name_education_type, def_30_cnt_social_circle, sk_id_curr, flag_document_12, years_build_avg, entrances_avg, wallsmaterial_mode, years_build_mode, reg_region_not_live_region, region_population_relative, fondkapremont_mode, years_beginexpluatation_medi, housetype_mode, obs_30_cnt_social_circle, totalarea_mode, live_city_not_work_city, ext_source_3, days_last_phone_change, flag_emp_phone, flag_document_21, amt_req_credit_bureau_year, reg_city_not_work_city, flag_document_11, flag_cont_mobile, region_rating_client, amt_goods_price, region_rating_client_w_city, flag_document_17, flag_document_20, nonlivingapartments_mode, years_build_medi, flag_mobil, floorsmin_medi, days_employed, nonlivingapartments_avg, flag_document_7, name_housing_type, flag_document_3, name_contract_type, elevators_medi, own_car_age, elevators_avg, nonlivingarea_avg, years_beginexpluatation_mode, entrances_medi, nonlivingapartments_medi, cnt_children, occupation_type, livingapartments_mode, organization_type, livingapartments_avg, amt_req_credit_bureau_week, basementarea_avg, years_beginexpluatation_avg, emergencystate_mode, flag_document_14, days_birth, days_registration, apartments_mode, apartments_avg, floorsmax_avg, floorsmax_mode, flag_document_8, flag_document_9, amt_req_credit_bureau_hour, livingarea_avg, ext_source_1, obs_60_cnt_social_circle, flag_phone, flag_own_car, floorsmax_medi, flag_work_phone, apartments_medi, days_id_publish, flag_document_4, nonlivingarea_medi, amt_annuity, flag_document_16, flag_document_6, code_gender, basementarea_mode, entrances_mode, name_type_suite, amt_req_credit_bureau_qrt, landarea_avg, commonarea_mode, commonarea_medi, flag_document_2, weekday_appr_process_start, floorsmin_avg, floorsmin_mode, amt_req_credit_bureau_mon, live_region_not_work_region, reg_region_not_work_region, amt_income_total, amt_credit, landarea_medi, elevators_mode, landarea_mode, livingapartments_medi, flag_document_5, hour_appr_process_start, flag_document_10, def_60_cnt_social_circle, livingarea_mode, flag_own_realty, nonlivingarea_mode, flag_document_15, ext_source_2, basementarea_medi, amt_req_credit_bureau_day, reg_city_not_live_city, flag_document_19, livingarea_medi, name_family_status, commonarea_avg sk_id_curr, amt_annuity No common columns found. sk_id_curr sk_id_curr sk_id_curr sk_id_curr, name_contract_type, name_type_suite, hour_appr_process_start, amt_annuity, amt_goods_price, weekday_appr_process_start, amt_credit sk_id_curr, amt_annuity No common columns found. sk_id_curr sk_id_curr sk_id_curr sk_id_curr, name_contract_type, name_type_suite, hour_appr_process_start, amt_annuity, amt_goods_price, weekday_appr_process_start, amt_credit sk_id_bureau sk_id_curr sk_id_curr sk_id_curr sk_id_curr, amt_annuity months_balance No common columns found. months_balance No common columns found. sk_id_prev, sk_id_curr sk_id_prev, sk_id_curr, sk_dpd_def, months_balance, sk_dpd, name_contract_status sk_id_prev, sk_id_curr, name_contract_status sk_id_prev, sk_id_curr sk_id_prev, sk_id_curr sk_id_prev, sk_id_curr, name_contract_status
# Printing info about each of the dataframes
dataframes = [applicationtestDF, applicationtrainDF, bureauDF, bureaubalanceDF, creditcardbalanceDF, installmentpaymentsDF, poscashDF, previousDF]
for df in dataframes:
display(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB
None
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
None
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1716428 entries, 0 to 1716427 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 SK_ID_BUREAU int64 2 CREDIT_ACTIVE object 3 CREDIT_CURRENCY object 4 DAYS_CREDIT int64 5 CREDIT_DAY_OVERDUE int64 6 DAYS_CREDIT_ENDDATE float64 7 DAYS_ENDDATE_FACT float64 8 AMT_CREDIT_MAX_OVERDUE float64 9 CNT_CREDIT_PROLONG int64 10 AMT_CREDIT_SUM float64 11 AMT_CREDIT_SUM_DEBT float64 12 AMT_CREDIT_SUM_LIMIT float64 13 AMT_CREDIT_SUM_OVERDUE float64 14 CREDIT_TYPE object 15 DAYS_CREDIT_UPDATE int64 16 AMT_ANNUITY float64 dtypes: float64(8), int64(6), object(3) memory usage: 222.6+ MB
None
<class 'pandas.core.frame.DataFrame'> RangeIndex: 27299925 entries, 0 to 27299924 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 SK_ID_BUREAU int64 1 MONTHS_BALANCE int64 2 STATUS object dtypes: int64(2), object(1) memory usage: 624.8+ MB
None
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3840312 entries, 0 to 3840311 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 AMT_BALANCE float64 4 AMT_CREDIT_LIMIT_ACTUAL int64 5 AMT_DRAWINGS_ATM_CURRENT float64 6 AMT_DRAWINGS_CURRENT float64 7 AMT_DRAWINGS_OTHER_CURRENT float64 8 AMT_DRAWINGS_POS_CURRENT float64 9 AMT_INST_MIN_REGULARITY float64 10 AMT_PAYMENT_CURRENT float64 11 AMT_PAYMENT_TOTAL_CURRENT float64 12 AMT_RECEIVABLE_PRINCIPAL float64 13 AMT_RECIVABLE float64 14 AMT_TOTAL_RECEIVABLE float64 15 CNT_DRAWINGS_ATM_CURRENT float64 16 CNT_DRAWINGS_CURRENT int64 17 CNT_DRAWINGS_OTHER_CURRENT float64 18 CNT_DRAWINGS_POS_CURRENT float64 19 CNT_INSTALMENT_MATURE_CUM float64 20 NAME_CONTRACT_STATUS object 21 SK_DPD int64 22 SK_DPD_DEF int64 dtypes: float64(15), int64(7), object(1) memory usage: 673.9+ MB
None
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13605401 entries, 0 to 13605400 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 NUM_INSTALMENT_VERSION float64 3 NUM_INSTALMENT_NUMBER int64 4 DAYS_INSTALMENT float64 5 DAYS_ENTRY_PAYMENT float64 6 AMT_INSTALMENT float64 7 AMT_PAYMENT float64 dtypes: float64(5), int64(3) memory usage: 830.4 MB
None
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10001358 entries, 0 to 10001357 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 CNT_INSTALMENT float64 4 CNT_INSTALMENT_FUTURE float64 5 NAME_CONTRACT_STATUS object 6 SK_DPD int64 7 SK_DPD_DEF int64 dtypes: float64(2), int64(5), object(1) memory usage: 610.4+ MB
None
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB
None
# Get summary statistics of the numerical columns
dataframes = [applicationtestDF, applicationtrainDF, bureauDF, bureaubalanceDF, creditcardbalanceDF, installmentpaymentsDF, poscashDF, previousDF]
for df in dataframes:
print(df.describe())
SK_ID_CURR CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT \
count 48744.000000 48744.000000 4.874400e+04 4.874400e+04
mean 277796.676350 0.397054 1.784318e+05 5.167404e+05
std 103169.547296 0.709047 1.015226e+05 3.653970e+05
min 100001.000000 0.000000 2.694150e+04 4.500000e+04
25% 188557.750000 0.000000 1.125000e+05 2.606400e+05
50% 277549.000000 0.000000 1.575000e+05 4.500000e+05
75% 367555.500000 1.000000 2.250000e+05 6.750000e+05
max 456250.000000 20.000000 4.410000e+06 2.245500e+06
AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE \
count 48720.000000 4.874400e+04 48744.000000
mean 29426.240209 4.626188e+05 0.021226
std 16016.368315 3.367102e+05 0.014428
min 2295.000000 4.500000e+04 0.000253
25% 17973.000000 2.250000e+05 0.010006
50% 26199.000000 3.960000e+05 0.018850
75% 37390.500000 6.300000e+05 0.028663
max 180576.000000 2.245500e+06 0.072508
DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION ... FLAG_DOCUMENT_18 \
count 48744.000000 48744.000000 48744.000000 ... 48744.000000
mean -16068.084605 67485.366322 -4967.652716 ... 0.001559
std 4325.900393 144348.507136 3552.612035 ... 0.039456
min -25195.000000 -17463.000000 -23722.000000 ... 0.000000
25% -19637.000000 -2910.000000 -7459.250000 ... 0.000000
50% -15785.000000 -1293.000000 -4490.000000 ... 0.000000
75% -12496.000000 -296.000000 -1901.000000 ... 0.000000
max -7338.000000 365243.000000 0.000000 ... 1.000000
FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 \
count 48744.0 48744.0 48744.0
mean 0.0 0.0 0.0
std 0.0 0.0 0.0
min 0.0 0.0 0.0
25% 0.0 0.0 0.0
50% 0.0 0.0 0.0
75% 0.0 0.0 0.0
max 0.0 0.0 0.0
AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY \
count 42695.000000 42695.000000
mean 0.002108 0.001803
std 0.046373 0.046132
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 0.000000
max 2.000000 2.000000
AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON \
count 42695.000000 42695.000000
mean 0.002787 0.009299
std 0.054037 0.110924
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 0.000000
max 2.000000 6.000000
AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 42695.000000 42695.000000
mean 0.546902 1.983769
std 0.693305 1.838873
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 2.000000
75% 1.000000 3.000000
max 7.000000 17.000000
[8 rows x 105 columns]
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL \
count 307511.000000 307511.000000 307511.000000 3.075110e+05
mean 278180.518577 0.080729 0.417052 1.687979e+05
std 102790.175348 0.272419 0.722121 2.371231e+05
min 100002.000000 0.000000 0.000000 2.565000e+04
25% 189145.500000 0.000000 0.000000 1.125000e+05
50% 278202.000000 0.000000 0.000000 1.471500e+05
75% 367142.500000 0.000000 1.000000 2.025000e+05
max 456255.000000 1.000000 19.000000 1.170000e+08
AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE \
count 3.075110e+05 307499.000000 3.072330e+05
mean 5.990260e+05 27108.573909 5.383962e+05
std 4.024908e+05 14493.737315 3.694465e+05
min 4.500000e+04 1615.500000 4.050000e+04
25% 2.700000e+05 16524.000000 2.385000e+05
50% 5.135310e+05 24903.000000 4.500000e+05
75% 8.086500e+05 34596.000000 6.795000e+05
max 4.050000e+06 258025.500000 4.050000e+06
REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED ... \
count 307511.000000 307511.000000 307511.000000 ...
mean 0.020868 -16036.995067 63815.045904 ...
std 0.013831 4363.988632 141275.766519 ...
min 0.000290 -25229.000000 -17912.000000 ...
25% 0.010006 -19682.000000 -2760.000000 ...
50% 0.018850 -15750.000000 -1213.000000 ...
75% 0.028663 -12413.000000 -289.000000 ...
max 0.072508 -7489.000000 365243.000000 ...
FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 \
count 307511.000000 307511.000000 307511.000000 307511.000000
mean 0.008130 0.000595 0.000507 0.000335
std 0.089798 0.024387 0.022518 0.018299
min 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 1.000000 1.000000
AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY \
count 265992.000000 265992.000000
mean 0.006402 0.007000
std 0.083849 0.110757
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 0.000000
max 4.000000 9.000000
AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON \
count 265992.000000 265992.000000
mean 0.034362 0.267395
std 0.204685 0.916002
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 0.000000
max 8.000000 27.000000
AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 265992.000000 265992.000000
mean 0.265474 1.899974
std 0.794056 1.869295
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 1.000000
75% 0.000000 3.000000
max 261.000000 25.000000
[8 rows x 106 columns]
SK_ID_CURR SK_ID_BUREAU DAYS_CREDIT CREDIT_DAY_OVERDUE \
count 1.716428e+06 1.716428e+06 1.716428e+06 1.716428e+06
mean 2.782149e+05 5.924434e+06 -1.142108e+03 8.181666e-01
std 1.029386e+05 5.322657e+05 7.951649e+02 3.654443e+01
min 1.000010e+05 5.000000e+06 -2.922000e+03 0.000000e+00
25% 1.888668e+05 5.463954e+06 -1.666000e+03 0.000000e+00
50% 2.780550e+05 5.926304e+06 -9.870000e+02 0.000000e+00
75% 3.674260e+05 6.385681e+06 -4.740000e+02 0.000000e+00
max 4.562550e+05 6.843457e+06 0.000000e+00 2.792000e+03
DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE \
count 1.610875e+06 1.082775e+06 5.919400e+05
mean 5.105174e+02 -1.017437e+03 3.825418e+03
std 4.994220e+03 7.140106e+02 2.060316e+05
min -4.206000e+04 -4.202300e+04 0.000000e+00
25% -1.138000e+03 -1.489000e+03 0.000000e+00
50% -3.300000e+02 -8.970000e+02 0.000000e+00
75% 4.740000e+02 -4.250000e+02 0.000000e+00
max 3.119900e+04 0.000000e+00 1.159872e+08
CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT \
count 1.716428e+06 1.716415e+06 1.458759e+06
mean 6.410406e-03 3.549946e+05 1.370851e+05
std 9.622391e-02 1.149811e+06 6.774011e+05
min 0.000000e+00 0.000000e+00 -4.705600e+06
25% 0.000000e+00 5.130000e+04 0.000000e+00
50% 0.000000e+00 1.255185e+05 0.000000e+00
75% 0.000000e+00 3.150000e+05 4.015350e+04
max 9.000000e+00 5.850000e+08 1.701000e+08
AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE DAYS_CREDIT_UPDATE \
count 1.124648e+06 1.716428e+06 1.716428e+06
mean 6.229515e+03 3.791276e+01 -5.937483e+02
std 4.503203e+04 5.937650e+03 7.207473e+02
min -5.864061e+05 0.000000e+00 -4.194700e+04
25% 0.000000e+00 0.000000e+00 -9.080000e+02
50% 0.000000e+00 0.000000e+00 -3.950000e+02
75% 0.000000e+00 0.000000e+00 -3.300000e+01
max 4.705600e+06 3.756681e+06 3.720000e+02
AMT_ANNUITY
count 4.896370e+05
mean 1.571276e+04
std 3.258269e+05
min 0.000000e+00
25% 0.000000e+00
50% 0.000000e+00
75% 1.350000e+04
max 1.184534e+08
SK_ID_BUREAU MONTHS_BALANCE
count 2.729992e+07 2.729992e+07
mean 6.036297e+06 -3.074169e+01
std 4.923489e+05 2.386451e+01
min 5.001709e+06 -9.600000e+01
25% 5.730933e+06 -4.600000e+01
50% 6.070821e+06 -2.500000e+01
75% 6.431951e+06 -1.100000e+01
max 6.842888e+06 0.000000e+00
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE \
count 3.840312e+06 3.840312e+06 3.840312e+06 3.840312e+06
mean 1.904504e+06 2.783242e+05 -3.452192e+01 5.830016e+04
std 5.364695e+05 1.027045e+05 2.666775e+01 1.063070e+05
min 1.000018e+06 1.000060e+05 -9.600000e+01 -4.202502e+05
25% 1.434385e+06 1.895170e+05 -5.500000e+01 0.000000e+00
50% 1.897122e+06 2.783960e+05 -2.800000e+01 0.000000e+00
75% 2.369328e+06 3.675800e+05 -1.100000e+01 8.904669e+04
max 2.843496e+06 4.562500e+05 -1.000000e+00 1.505902e+06
AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT \
count 3.840312e+06 3.090496e+06
mean 1.538080e+05 5.961325e+03
std 1.651457e+05 2.822569e+04
min 0.000000e+00 -6.827310e+03
25% 4.500000e+04 0.000000e+00
50% 1.125000e+05 0.000000e+00
75% 1.800000e+05 0.000000e+00
max 1.350000e+06 2.115000e+06
AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT \
count 3.840312e+06 3.090496e+06
mean 7.433388e+03 2.881696e+02
std 3.384608e+04 8.201989e+03
min -6.211620e+03 0.000000e+00
25% 0.000000e+00 0.000000e+00
50% 0.000000e+00 0.000000e+00
75% 0.000000e+00 0.000000e+00
max 2.287098e+06 1.529847e+06
AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY ... \
count 3.090496e+06 3.535076e+06 ...
mean 2.968805e+03 3.540204e+03 ...
std 2.079689e+04 5.600154e+03 ...
min 0.000000e+00 0.000000e+00 ...
25% 0.000000e+00 0.000000e+00 ...
50% 0.000000e+00 0.000000e+00 ...
75% 0.000000e+00 6.633911e+03 ...
max 2.239274e+06 2.028820e+05 ...
AMT_RECEIVABLE_PRINCIPAL AMT_RECIVABLE AMT_TOTAL_RECEIVABLE \
count 3.840312e+06 3.840312e+06 3.840312e+06
mean 5.596588e+04 5.808881e+04 5.809829e+04
std 1.025336e+05 1.059654e+05 1.059718e+05
min -4.233058e+05 -4.202502e+05 -4.202502e+05
25% 0.000000e+00 0.000000e+00 0.000000e+00
50% 0.000000e+00 0.000000e+00 0.000000e+00
75% 8.535924e+04 8.889949e+04 8.891451e+04
max 1.472317e+06 1.493338e+06 1.493338e+06
CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT \
count 3.090496e+06 3.840312e+06
mean 3.094490e-01 7.031439e-01
std 1.100401e+00 3.190347e+00
min 0.000000e+00 0.000000e+00
25% 0.000000e+00 0.000000e+00
50% 0.000000e+00 0.000000e+00
75% 0.000000e+00 0.000000e+00
max 5.100000e+01 1.650000e+02
CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT \
count 3.090496e+06 3.090496e+06
mean 4.812496e-03 5.594791e-01
std 8.263861e-02 3.240649e+00
min 0.000000e+00 0.000000e+00
25% 0.000000e+00 0.000000e+00
50% 0.000000e+00 0.000000e+00
75% 0.000000e+00 0.000000e+00
max 1.200000e+01 1.650000e+02
CNT_INSTALMENT_MATURE_CUM SK_DPD SK_DPD_DEF
count 3.535076e+06 3.840312e+06 3.840312e+06
mean 2.082508e+01 9.283667e+00 3.316220e-01
std 2.005149e+01 9.751570e+01 2.147923e+01
min 0.000000e+00 0.000000e+00 0.000000e+00
25% 4.000000e+00 0.000000e+00 0.000000e+00
50% 1.500000e+01 0.000000e+00 0.000000e+00
75% 3.200000e+01 0.000000e+00 0.000000e+00
max 1.200000e+02 3.260000e+03 3.260000e+03
[8 rows x 22 columns]
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION \
count 1.360540e+07 1.360540e+07 1.360540e+07
mean 1.903365e+06 2.784449e+05 8.566373e-01
std 5.362029e+05 1.027183e+05 1.035216e+00
min 1.000001e+06 1.000010e+05 0.000000e+00
25% 1.434191e+06 1.896390e+05 0.000000e+00
50% 1.896520e+06 2.786850e+05 1.000000e+00
75% 2.369094e+06 3.675300e+05 1.000000e+00
max 2.843499e+06 4.562550e+05 1.780000e+02
NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT \
count 1.360540e+07 1.360540e+07 1.360250e+07
mean 1.887090e+01 -1.042270e+03 -1.051114e+03
std 2.666407e+01 8.009463e+02 8.005859e+02
min 1.000000e+00 -2.922000e+03 -4.921000e+03
25% 4.000000e+00 -1.654000e+03 -1.662000e+03
50% 8.000000e+00 -8.180000e+02 -8.270000e+02
75% 1.900000e+01 -3.610000e+02 -3.700000e+02
max 2.770000e+02 -1.000000e+00 -1.000000e+00
AMT_INSTALMENT AMT_PAYMENT
count 1.360540e+07 1.360250e+07
mean 1.705091e+04 1.723822e+04
std 5.057025e+04 5.473578e+04
min 0.000000e+00 0.000000e+00
25% 4.226085e+03 3.398265e+03
50% 8.884080e+03 8.125515e+03
75% 1.671021e+04 1.610842e+04
max 3.771488e+06 3.771488e+06
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT \
count 1.000136e+07 1.000136e+07 1.000136e+07 9.975287e+06
mean 1.903217e+06 2.784039e+05 -3.501259e+01 1.708965e+01
std 5.358465e+05 1.027637e+05 2.606657e+01 1.199506e+01
min 1.000001e+06 1.000010e+05 -9.600000e+01 1.000000e+00
25% 1.434405e+06 1.895500e+05 -5.400000e+01 1.000000e+01
50% 1.896565e+06 2.786540e+05 -2.800000e+01 1.200000e+01
75% 2.368963e+06 3.674290e+05 -1.300000e+01 2.400000e+01
max 2.843499e+06 4.562550e+05 -1.000000e+00 9.200000e+01
CNT_INSTALMENT_FUTURE SK_DPD SK_DPD_DEF
count 9.975271e+06 1.000136e+07 1.000136e+07
mean 1.048384e+01 1.160693e+01 6.544684e-01
std 1.110906e+01 1.327140e+02 3.276249e+01
min 0.000000e+00 0.000000e+00 0.000000e+00
25% 3.000000e+00 0.000000e+00 0.000000e+00
50% 7.000000e+00 0.000000e+00 0.000000e+00
75% 1.400000e+01 0.000000e+00 0.000000e+00
max 8.500000e+01 4.231000e+03 3.595000e+03
SK_ID_PREV SK_ID_CURR AMT_ANNUITY AMT_APPLICATION \
count 1.670214e+06 1.670214e+06 1.297979e+06 1.670214e+06
mean 1.923089e+06 2.783572e+05 1.595512e+04 1.752339e+05
std 5.325980e+05 1.028148e+05 1.478214e+04 2.927798e+05
min 1.000001e+06 1.000010e+05 0.000000e+00 0.000000e+00
25% 1.461857e+06 1.893290e+05 6.321780e+03 1.872000e+04
50% 1.923110e+06 2.787145e+05 1.125000e+04 7.104600e+04
75% 2.384280e+06 3.675140e+05 2.065842e+04 1.803600e+05
max 2.845382e+06 4.562550e+05 4.180581e+05 6.905160e+06
AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE \
count 1.670213e+06 7.743700e+05 1.284699e+06
mean 1.961140e+05 6.697402e+03 2.278473e+05
std 3.185746e+05 2.092150e+04 3.153966e+05
min 0.000000e+00 -9.000000e-01 0.000000e+00
25% 2.416050e+04 0.000000e+00 5.084100e+04
50% 8.054100e+04 1.638000e+03 1.123200e+05
75% 2.164185e+05 7.740000e+03 2.340000e+05
max 6.905160e+06 3.060045e+06 6.905160e+06
HOUR_APPR_PROCESS_START NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT \
count 1.670214e+06 1.670214e+06 774370.000000
mean 1.248418e+01 9.964675e-01 0.079637
std 3.334028e+00 5.932963e-02 0.107823
min 0.000000e+00 0.000000e+00 -0.000015
25% 1.000000e+01 1.000000e+00 0.000000
50% 1.200000e+01 1.000000e+00 0.051605
75% 1.500000e+01 1.000000e+00 0.108909
max 2.300000e+01 1.000000e+00 1.000000
... RATE_INTEREST_PRIVILEGED DAYS_DECISION SELLERPLACE_AREA \
count ... 5951.000000 1.670214e+06 1.670214e+06
mean ... 0.773503 -8.806797e+02 3.139511e+02
std ... 0.100879 7.790997e+02 7.127443e+03
min ... 0.373150 -2.922000e+03 -1.000000e+00
25% ... 0.715645 -1.300000e+03 -1.000000e+00
50% ... 0.835095 -5.810000e+02 3.000000e+00
75% ... 0.852537 -2.800000e+02 8.200000e+01
max ... 1.000000 -1.000000e+00 4.000000e+06
CNT_PAYMENT DAYS_FIRST_DRAWING DAYS_FIRST_DUE \
count 1.297984e+06 997149.000000 997149.000000
mean 1.605408e+01 342209.855039 13826.269337
std 1.456729e+01 88916.115833 72444.869708
min 0.000000e+00 -2922.000000 -2892.000000
25% 6.000000e+00 365243.000000 -1628.000000
50% 1.200000e+01 365243.000000 -831.000000
75% 2.400000e+01 365243.000000 -411.000000
max 8.400000e+01 365243.000000 365243.000000
DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION \
count 997149.000000 997149.000000 997149.000000
mean 33767.774054 76582.403064 81992.343838
std 106857.034789 149647.415123 153303.516729
min -2801.000000 -2889.000000 -2874.000000
25% -1242.000000 -1314.000000 -1270.000000
50% -361.000000 -537.000000 -499.000000
75% 129.000000 -74.000000 -44.000000
max 365243.000000 365243.000000 365243.000000
NFLAG_INSURED_ON_APPROVAL
count 997149.000000
mean 0.332570
std 0.471134
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 1.000000
[8 rows x 21 columns]
# Getting information about the categorical columns
# installmentpaymentsDF does not include any categorical data so we do not include it.
dataframes = [applicationtestDF, applicationtrainDF, bureauDF, bureaubalanceDF, creditcardbalanceDF, poscashDF, previousDF]
for df in dataframes:
print(df.describe(include='object'))
NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY \
count 48744 48744 48744 48744
unique 2 2 2 2
top Cash loans F N Y
freq 48305 32678 32311 33658
NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE \
count 47833 48744 48744
unique 7 7 5
top Unaccompanied Working Secondary / secondary special
freq 39727 24533 33988
NAME_FAMILY_STATUS NAME_HOUSING_TYPE OCCUPATION_TYPE \
count 48744 48744 33139
unique 5 6 18
top Married House / apartment Laborers
freq 32283 43645 8655
WEEKDAY_APPR_PROCESS_START ORGANIZATION_TYPE FONDKAPREMONT_MODE \
count 48744 48744 15947
unique 7 58 4
top TUESDAY Business Entity Type 3 reg oper account
freq 9751 10840 12124
HOUSETYPE_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE
count 25125 24851 26535
unique 3 7 2
top block of flats Panel No
freq 24659 11269 26179
NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY \
count 307511 307511 307511 307511
unique 2 3 2 2
top Cash loans F N Y
freq 278232 202448 202924 213312
NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE \
count 306219 307511 307511
unique 7 8 5
top Unaccompanied Working Secondary / secondary special
freq 248526 158774 218391
NAME_FAMILY_STATUS NAME_HOUSING_TYPE OCCUPATION_TYPE \
count 307511 307511 211120
unique 6 6 18
top Married House / apartment Laborers
freq 196432 272868 55186
WEEKDAY_APPR_PROCESS_START ORGANIZATION_TYPE FONDKAPREMONT_MODE \
count 307511 307511 97216
unique 7 58 4
top TUESDAY Business Entity Type 3 reg oper account
freq 53901 67992 73830
HOUSETYPE_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE
count 153214 151170 161756
unique 3 7 2
top block of flats Panel No
freq 150503 66040 159428
CREDIT_ACTIVE CREDIT_CURRENCY CREDIT_TYPE
count 1716428 1716428 1716428
unique 4 4 15
top Closed currency 1 Consumer credit
freq 1079273 1715020 1251615
STATUS
count 27299925
unique 8
top C
freq 13646993
NAME_CONTRACT_STATUS
count 3840312
unique 7
top Active
freq 3698436
NAME_CONTRACT_STATUS
count 10001358
unique 9
top Active
freq 9151119
NAME_CONTRACT_TYPE WEEKDAY_APPR_PROCESS_START \
count 1670214 1670214
unique 4 7
top Cash loans TUESDAY
freq 747553 255118
FLAG_LAST_APPL_PER_CONTRACT NAME_CASH_LOAN_PURPOSE \
count 1670214 1670214
unique 2 25
top Y XAP
freq 1661739 922661
NAME_CONTRACT_STATUS NAME_PAYMENT_TYPE CODE_REJECT_REASON \
count 1670214 1670214 1670214
unique 4 4 9
top Approved Cash through the bank XAP
freq 1036781 1033552 1353093
NAME_TYPE_SUITE NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO \
count 849809 1670214 1670214 1670214
unique 7 4 28 5
top Unaccompanied Repeater XNA POS
freq 508970 1231261 950809 691011
NAME_PRODUCT_TYPE CHANNEL_TYPE NAME_SELLER_INDUSTRY \
count 1670214 1670214 1670214
unique 3 8 11
top XNA Credit and cash offices XNA
freq 1063666 719968 855720
NAME_YIELD_GROUP PRODUCT_COMBINATION
count 1670214 1669868
unique 5 17
top XNA Cash
freq 517215 285990
# Get the unique values in each categorical column
def get_unique_values(data):
unique_values = {}
for column in data.select_dtypes(include='object'):
unique_values[column] = data[column].unique()
for column, values in unique_values.items():
print("\n")
print(f"{column}: {values}")
# Example usage with multiple DataFrames
get_unique_values(applicationtestDF)
get_unique_values(applicationtrainDF)
get_unique_values(bureauDF)
get_unique_values(bureaubalanceDF)
get_unique_values(creditcardbalanceDF)
get_unique_values(installmentpaymentsDF)
get_unique_values(poscashDF)
get_unique_values(previousDF)
NAME_CONTRACT_TYPE: ['Cash loans' 'Revolving loans'] CODE_GENDER: ['F' 'M'] FLAG_OWN_CAR: ['N' 'Y'] FLAG_OWN_REALTY: ['Y' 'N'] NAME_TYPE_SUITE: ['Unaccompanied' nan 'Family' 'Spouse, partner' 'Group of people' 'Other_B' 'Children' 'Other_A'] NAME_INCOME_TYPE: ['Working' 'State servant' 'Pensioner' 'Commercial associate' 'Businessman' 'Student' 'Unemployed'] NAME_EDUCATION_TYPE: ['Higher education' 'Secondary / secondary special' 'Incomplete higher' 'Lower secondary' 'Academic degree'] NAME_FAMILY_STATUS: ['Married' 'Single / not married' 'Civil marriage' 'Widow' 'Separated'] NAME_HOUSING_TYPE: ['House / apartment' 'With parents' 'Rented apartment' 'Municipal apartment' 'Office apartment' 'Co-op apartment'] OCCUPATION_TYPE: [nan 'Low-skill Laborers' 'Drivers' 'Sales staff' 'High skill tech staff' 'Core staff' 'Laborers' 'Managers' 'Accountants' 'Medicine staff' 'Security staff' 'Private service staff' 'Secretaries' 'Cleaning staff' 'Cooking staff' 'HR staff' 'Waiters/barmen staff' 'Realty agents' 'IT staff'] WEEKDAY_APPR_PROCESS_START: ['TUESDAY' 'FRIDAY' 'MONDAY' 'WEDNESDAY' 'THURSDAY' 'SATURDAY' 'SUNDAY'] ORGANIZATION_TYPE: ['Kindergarten' 'Self-employed' 'Transport: type 3' 'Business Entity Type 3' 'Government' 'Industry: type 9' 'School' 'Trade: type 2' 'XNA' 'Services' 'Bank' 'Industry: type 3' 'Other' 'Trade: type 6' 'Industry: type 12' 'Trade: type 7' 'Postal' 'Medicine' 'Housing' 'Business Entity Type 2' 'Construction' 'Military' 'Industry: type 4' 'Trade: type 3' 'Legal Services' 'Security' 'Industry: type 11' 'University' 'Business Entity Type 1' 'Agriculture' 'Security Ministries' 'Transport: type 2' 'Industry: type 7' 'Transport: type 4' 'Telecom' 'Emergency' 'Police' 'Industry: type 1' 'Transport: type 1' 'Electricity' 'Industry: type 5' 'Hotel' 'Restaurant' 'Advertising' 'Mobile' 'Trade: type 1' 'Industry: type 8' 'Realtor' 'Cleaning' 'Industry: type 2' 'Trade: type 4' 'Industry: type 6' 'Culture' 'Insurance' 'Religion' 'Industry: type 13' 'Industry: type 10' 'Trade: type 5'] FONDKAPREMONT_MODE: [nan 'reg oper account' 'not specified' 'org spec account' 'reg oper spec account'] HOUSETYPE_MODE: ['block of flats' nan 'specific housing' 'terraced house'] WALLSMATERIAL_MODE: ['Stone, brick' nan 'Panel' 'Block' 'Wooden' 'Mixed' 'Monolithic' 'Others'] EMERGENCYSTATE_MODE: ['No' nan 'Yes'] NAME_CONTRACT_TYPE: ['Cash loans' 'Revolving loans'] CODE_GENDER: ['M' 'F' 'XNA'] FLAG_OWN_CAR: ['N' 'Y'] FLAG_OWN_REALTY: ['Y' 'N'] NAME_TYPE_SUITE: ['Unaccompanied' 'Family' 'Spouse, partner' 'Children' 'Other_A' nan 'Other_B' 'Group of people'] NAME_INCOME_TYPE: ['Working' 'State servant' 'Commercial associate' 'Pensioner' 'Unemployed' 'Student' 'Businessman' 'Maternity leave'] NAME_EDUCATION_TYPE: ['Secondary / secondary special' 'Higher education' 'Incomplete higher' 'Lower secondary' 'Academic degree'] NAME_FAMILY_STATUS: ['Single / not married' 'Married' 'Civil marriage' 'Widow' 'Separated' 'Unknown'] NAME_HOUSING_TYPE: ['House / apartment' 'Rented apartment' 'With parents' 'Municipal apartment' 'Office apartment' 'Co-op apartment'] OCCUPATION_TYPE: ['Laborers' 'Core staff' 'Accountants' 'Managers' nan 'Drivers' 'Sales staff' 'Cleaning staff' 'Cooking staff' 'Private service staff' 'Medicine staff' 'Security staff' 'High skill tech staff' 'Waiters/barmen staff' 'Low-skill Laborers' 'Realty agents' 'Secretaries' 'IT staff' 'HR staff'] WEEKDAY_APPR_PROCESS_START: ['WEDNESDAY' 'MONDAY' 'THURSDAY' 'SUNDAY' 'SATURDAY' 'FRIDAY' 'TUESDAY'] ORGANIZATION_TYPE: ['Business Entity Type 3' 'School' 'Government' 'Religion' 'Other' 'XNA' 'Electricity' 'Medicine' 'Business Entity Type 2' 'Self-employed' 'Transport: type 2' 'Construction' 'Housing' 'Kindergarten' 'Trade: type 7' 'Industry: type 11' 'Military' 'Services' 'Security Ministries' 'Transport: type 4' 'Industry: type 1' 'Emergency' 'Security' 'Trade: type 2' 'University' 'Transport: type 3' 'Police' 'Business Entity Type 1' 'Postal' 'Industry: type 4' 'Agriculture' 'Restaurant' 'Culture' 'Hotel' 'Industry: type 7' 'Trade: type 3' 'Industry: type 3' 'Bank' 'Industry: type 9' 'Insurance' 'Trade: type 6' 'Industry: type 2' 'Transport: type 1' 'Industry: type 12' 'Mobile' 'Trade: type 1' 'Industry: type 5' 'Industry: type 10' 'Legal Services' 'Advertising' 'Trade: type 5' 'Cleaning' 'Industry: type 13' 'Trade: type 4' 'Telecom' 'Industry: type 8' 'Realtor' 'Industry: type 6'] FONDKAPREMONT_MODE: ['reg oper account' nan 'org spec account' 'reg oper spec account' 'not specified'] HOUSETYPE_MODE: ['block of flats' nan 'terraced house' 'specific housing'] WALLSMATERIAL_MODE: ['Stone, brick' 'Block' nan 'Panel' 'Mixed' 'Wooden' 'Others' 'Monolithic'] EMERGENCYSTATE_MODE: ['No' nan 'Yes'] CREDIT_ACTIVE: ['Closed' 'Active' 'Sold' 'Bad debt'] CREDIT_CURRENCY: ['currency 1' 'currency 2' 'currency 4' 'currency 3'] CREDIT_TYPE: ['Consumer credit' 'Credit card' 'Mortgage' 'Car loan' 'Microloan' 'Loan for working capital replenishment' 'Loan for business development' 'Real estate loan' 'Unknown type of loan' 'Another type of loan' 'Cash loan (non-earmarked)' 'Loan for the purchase of equipment' 'Mobile operator loan' 'Interbank credit' 'Loan for purchase of shares (margin lending)'] STATUS: ['C' '0' 'X' '1' '2' '3' '5' '4'] NAME_CONTRACT_STATUS: ['Active' 'Completed' 'Demand' 'Signed' 'Sent proposal' 'Refused' 'Approved'] NAME_CONTRACT_STATUS: ['Active' 'Completed' 'Signed' 'Approved' 'Returned to the store' 'Demand' 'Canceled' 'XNA' 'Amortized debt'] NAME_CONTRACT_TYPE: ['Consumer loans' 'Cash loans' 'Revolving loans' 'XNA'] WEEKDAY_APPR_PROCESS_START: ['SATURDAY' 'THURSDAY' 'TUESDAY' 'MONDAY' 'FRIDAY' 'SUNDAY' 'WEDNESDAY'] FLAG_LAST_APPL_PER_CONTRACT: ['Y' 'N'] NAME_CASH_LOAN_PURPOSE: ['XAP' 'XNA' 'Repairs' 'Everyday expenses' 'Car repairs' 'Building a house or an annex' 'Other' 'Journey' 'Purchase of electronic equipment' 'Medicine' 'Payments on other loans' 'Urgent needs' 'Buying a used car' 'Buying a new car' 'Buying a holiday home / land' 'Education' 'Buying a home' 'Furniture' 'Buying a garage' 'Business development' 'Wedding / gift / holiday' 'Hobby' 'Gasification / water supply' 'Refusal to name the goal' 'Money for a third person'] NAME_CONTRACT_STATUS: ['Approved' 'Refused' 'Canceled' 'Unused offer'] NAME_PAYMENT_TYPE: ['Cash through the bank' 'XNA' 'Non-cash from your account' 'Cashless from the account of the employer'] CODE_REJECT_REASON: ['XAP' 'HC' 'LIMIT' 'CLIENT' 'SCOFR' 'SCO' 'XNA' 'VERIF' 'SYSTEM'] NAME_TYPE_SUITE: [nan 'Unaccompanied' 'Spouse, partner' 'Family' 'Children' 'Other_B' 'Other_A' 'Group of people'] NAME_CLIENT_TYPE: ['Repeater' 'New' 'Refreshed' 'XNA'] NAME_GOODS_CATEGORY: ['Mobile' 'XNA' 'Consumer Electronics' 'Construction Materials' 'Auto Accessories' 'Photo / Cinema Equipment' 'Computers' 'Audio/Video' 'Medicine' 'Clothing and Accessories' 'Furniture' 'Sport and Leisure' 'Homewares' 'Gardening' 'Jewelry' 'Vehicles' 'Education' 'Medical Supplies' 'Other' 'Direct Sales' 'Office Appliances' 'Fitness' 'Tourism' 'Insurance' 'Additional Service' 'Weapon' 'Animals' 'House Construction'] NAME_PORTFOLIO: ['POS' 'Cash' 'XNA' 'Cards' 'Cars'] NAME_PRODUCT_TYPE: ['XNA' 'x-sell' 'walk-in'] CHANNEL_TYPE: ['Country-wide' 'Contact center' 'Credit and cash offices' 'Stone' 'Regional / Local' 'AP+ (Cash loan)' 'Channel of corporate sales' 'Car dealer'] NAME_SELLER_INDUSTRY: ['Connectivity' 'XNA' 'Consumer electronics' 'Industry' 'Clothing' 'Furniture' 'Construction' 'Jewelry' 'Auto technology' 'MLM partners' 'Tourism'] NAME_YIELD_GROUP: ['middle' 'low_action' 'high' 'low_normal' 'XNA'] PRODUCT_COMBINATION: ['POS mobile with interest' 'Cash X-Sell: low' 'Cash X-Sell: high' 'Cash X-Sell: middle' 'Cash Street: high' 'Cash' 'POS household without interest' 'POS household with interest' 'POS other with interest' 'Card X-Sell' 'POS mobile without interest' 'Card Street' 'POS industry with interest' 'Cash Street: low' 'POS industry without interest' 'Cash Street: middle' 'POS others without interest' nan]
#Shape of the train and test set
#training set
num_rows_train = applicationtrainDF.shape
print(num_rows_train)
#test set
num_rows_test = applicationtestDF.shape
print(num_rows_test)
(307511, 122) (48744, 121)
There are 50 variables in both data sets where 40% or more of their values are null. Nearly all of those variables involve a applicant's information about their living accomodations.
#Missing count and percentage of each column in training set
#Check for null and provide a count of the total
missing_train = applicationtrainDF.isnull().sum()
#Filter for any columns that have greater than 0 null values
missing_train = missing_train[missing_train > 0]
#Sort the column in descending order based on count
missing_train = missing_train.sort_values(ascending=False)
#Provide a percentage of missing values in each column
percent_missing_train = (missing_train / len(applicationtrainDF)) * 100
#Create a data frame that includes the missing count and percentages
missing_data_train = pd.DataFrame({'Missing Count': missing_train, 'Percentage': percent_missing_train})
#Display all the columns based on the above criteria
with pd.option_context('display.max_rows', None):
print(missing_data_train)
Missing Count Percentage COMMONAREA_MEDI 214865 69.872297 COMMONAREA_AVG 214865 69.872297 COMMONAREA_MODE 214865 69.872297 NONLIVINGAPARTMENTS_MEDI 213514 69.432963 NONLIVINGAPARTMENTS_MODE 213514 69.432963 NONLIVINGAPARTMENTS_AVG 213514 69.432963 FONDKAPREMONT_MODE 210295 68.386172 LIVINGAPARTMENTS_MODE 210199 68.354953 LIVINGAPARTMENTS_MEDI 210199 68.354953 LIVINGAPARTMENTS_AVG 210199 68.354953 FLOORSMIN_MODE 208642 67.848630 FLOORSMIN_MEDI 208642 67.848630 FLOORSMIN_AVG 208642 67.848630 YEARS_BUILD_MODE 204488 66.497784 YEARS_BUILD_MEDI 204488 66.497784 YEARS_BUILD_AVG 204488 66.497784 OWN_CAR_AGE 202929 65.990810 LANDAREA_AVG 182590 59.376738 LANDAREA_MEDI 182590 59.376738 LANDAREA_MODE 182590 59.376738 BASEMENTAREA_MEDI 179943 58.515956 BASEMENTAREA_AVG 179943 58.515956 BASEMENTAREA_MODE 179943 58.515956 EXT_SOURCE_1 173378 56.381073 NONLIVINGAREA_MEDI 169682 55.179164 NONLIVINGAREA_MODE 169682 55.179164 NONLIVINGAREA_AVG 169682 55.179164 ELEVATORS_MEDI 163891 53.295980 ELEVATORS_MODE 163891 53.295980 ELEVATORS_AVG 163891 53.295980 WALLSMATERIAL_MODE 156341 50.840783 APARTMENTS_MODE 156061 50.749729 APARTMENTS_MEDI 156061 50.749729 APARTMENTS_AVG 156061 50.749729 ENTRANCES_MODE 154828 50.348768 ENTRANCES_AVG 154828 50.348768 ENTRANCES_MEDI 154828 50.348768 LIVINGAREA_MEDI 154350 50.193326 LIVINGAREA_MODE 154350 50.193326 LIVINGAREA_AVG 154350 50.193326 HOUSETYPE_MODE 154297 50.176091 FLOORSMAX_MEDI 153020 49.760822 FLOORSMAX_AVG 153020 49.760822 FLOORSMAX_MODE 153020 49.760822 YEARS_BEGINEXPLUATATION_AVG 150007 48.781019 YEARS_BEGINEXPLUATATION_MEDI 150007 48.781019 YEARS_BEGINEXPLUATATION_MODE 150007 48.781019 TOTALAREA_MODE 148431 48.268517 EMERGENCYSTATE_MODE 145755 47.398304 OCCUPATION_TYPE 96391 31.345545 EXT_SOURCE_3 60965 19.825307 AMT_REQ_CREDIT_BUREAU_WEEK 41519 13.501631 AMT_REQ_CREDIT_BUREAU_DAY 41519 13.501631 AMT_REQ_CREDIT_BUREAU_MON 41519 13.501631 AMT_REQ_CREDIT_BUREAU_QRT 41519 13.501631 AMT_REQ_CREDIT_BUREAU_HOUR 41519 13.501631 AMT_REQ_CREDIT_BUREAU_YEAR 41519 13.501631 NAME_TYPE_SUITE 1292 0.420148 DEF_30_CNT_SOCIAL_CIRCLE 1021 0.332021 OBS_60_CNT_SOCIAL_CIRCLE 1021 0.332021 DEF_60_CNT_SOCIAL_CIRCLE 1021 0.332021 OBS_30_CNT_SOCIAL_CIRCLE 1021 0.332021 EXT_SOURCE_2 660 0.214626 AMT_GOODS_PRICE 278 0.090403 AMT_ANNUITY 12 0.003902 CNT_FAM_MEMBERS 2 0.000650 DAYS_LAST_PHONE_CHANGE 1 0.000325
#See how many columns in applicationtestDF have null values
#Check for null and provide a count of the total
missing_test = applicationtestDF.isnull().sum()
#Filter for any columns that have greater than 0 null values
missing_test = missing_test[missing_test > 0]
#Sort the column in descending order based on count
missing_test = missing_test.sort_values(ascending=False)
#Provide a percentage of missing values in each column
percent_missing_test = (missing_test / len(applicationtestDF)) * 100
#Create a data frame that includes the missing count and percentages
missing_data_test = pd.DataFrame({'Missing Count': missing_test, 'Percentage': percent_missing_test})
#Display all the columns based on the above criteria
with pd.option_context('display.max_rows', None):
print(missing_data_test)
Missing Count Percentage COMMONAREA_MODE 33495 68.716150 COMMONAREA_MEDI 33495 68.716150 COMMONAREA_AVG 33495 68.716150 NONLIVINGAPARTMENTS_MEDI 33347 68.412523 NONLIVINGAPARTMENTS_AVG 33347 68.412523 NONLIVINGAPARTMENTS_MODE 33347 68.412523 FONDKAPREMONT_MODE 32797 67.284179 LIVINGAPARTMENTS_MODE 32780 67.249302 LIVINGAPARTMENTS_MEDI 32780 67.249302 LIVINGAPARTMENTS_AVG 32780 67.249302 FLOORSMIN_MEDI 32466 66.605121 FLOORSMIN_MODE 32466 66.605121 FLOORSMIN_AVG 32466 66.605121 OWN_CAR_AGE 32312 66.289184 YEARS_BUILD_AVG 31818 65.275726 YEARS_BUILD_MEDI 31818 65.275726 YEARS_BUILD_MODE 31818 65.275726 LANDAREA_MODE 28254 57.964057 LANDAREA_AVG 28254 57.964057 LANDAREA_MEDI 28254 57.964057 BASEMENTAREA_MEDI 27641 56.706466 BASEMENTAREA_AVG 27641 56.706466 BASEMENTAREA_MODE 27641 56.706466 NONLIVINGAREA_MEDI 26084 53.512227 NONLIVINGAREA_MODE 26084 53.512227 NONLIVINGAREA_AVG 26084 53.512227 ELEVATORS_MEDI 25189 51.676104 ELEVATORS_MODE 25189 51.676104 ELEVATORS_AVG 25189 51.676104 WALLSMATERIAL_MODE 23893 49.017315 APARTMENTS_MODE 23887 49.005006 APARTMENTS_MEDI 23887 49.005006 APARTMENTS_AVG 23887 49.005006 HOUSETYPE_MODE 23619 48.455194 ENTRANCES_MODE 23579 48.373133 ENTRANCES_MEDI 23579 48.373133 ENTRANCES_AVG 23579 48.373133 LIVINGAREA_AVG 23552 48.317742 LIVINGAREA_MEDI 23552 48.317742 LIVINGAREA_MODE 23552 48.317742 FLOORSMAX_MEDI 23321 47.843837 FLOORSMAX_MODE 23321 47.843837 FLOORSMAX_AVG 23321 47.843837 YEARS_BEGINEXPLUATATION_MEDI 22856 46.889874 YEARS_BEGINEXPLUATATION_MODE 22856 46.889874 YEARS_BEGINEXPLUATATION_AVG 22856 46.889874 TOTALAREA_MODE 22624 46.413918 EMERGENCYSTATE_MODE 22209 45.562531 EXT_SOURCE_1 20532 42.122107 OCCUPATION_TYPE 15605 32.014197 EXT_SOURCE_3 8668 17.782701 AMT_REQ_CREDIT_BUREAU_MON 6049 12.409732 AMT_REQ_CREDIT_BUREAU_HOUR 6049 12.409732 AMT_REQ_CREDIT_BUREAU_QRT 6049 12.409732 AMT_REQ_CREDIT_BUREAU_WEEK 6049 12.409732 AMT_REQ_CREDIT_BUREAU_DAY 6049 12.409732 AMT_REQ_CREDIT_BUREAU_YEAR 6049 12.409732 NAME_TYPE_SUITE 911 1.868948 DEF_60_CNT_SOCIAL_CIRCLE 29 0.059495 OBS_60_CNT_SOCIAL_CIRCLE 29 0.059495 DEF_30_CNT_SOCIAL_CIRCLE 29 0.059495 OBS_30_CNT_SOCIAL_CIRCLE 29 0.059495 AMT_ANNUITY 24 0.049237 EXT_SOURCE_2 8 0.016412
The variables that are missing more than 40% of their values have been normalized. Their values range from 0-1.
#Summary Metrics of the training set
# Filter columns with over 40% null values
missingdata_columns = missing_data_train[missing_data_train['Percentage'] > 40].index
# Calculate summary metrics for filtered columns
summary_metrics = applicationtrainDF[missingdata_columns].describe(include=np.number)
#Display summary metrics
print(summary_metrics)
COMMONAREA_MEDI COMMONAREA_AVG COMMONAREA_MODE \
count 92646.000000 92646.000000 92646.000000
mean 0.044595 0.044621 0.042553
std 0.076144 0.076036 0.074445
min 0.000000 0.000000 0.000000
25% 0.007900 0.007800 0.007200
50% 0.020800 0.021100 0.019000
75% 0.051300 0.051500 0.049000
max 1.000000 1.000000 1.000000
NONLIVINGAPARTMENTS_MEDI NONLIVINGAPARTMENTS_MODE \
count 93997.000000 93997.000000
mean 0.008651 0.008076
std 0.047415 0.046276
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.003900 0.003900
max 1.000000 1.000000
NONLIVINGAPARTMENTS_AVG LIVINGAPARTMENTS_MODE LIVINGAPARTMENTS_MEDI \
count 93997.000000 97312.000000 97312.000000
mean 0.008809 0.105645 0.101954
std 0.047732 0.097880 0.093642
min 0.000000 0.000000 0.000000
25% 0.000000 0.054200 0.051300
50% 0.000000 0.077100 0.076100
75% 0.003900 0.131300 0.123100
max 1.000000 1.000000 1.000000
LIVINGAPARTMENTS_AVG FLOORSMIN_MODE ... LIVINGAREA_MEDI \
count 97312.000000 98869.000000 ... 153161.000000
mean 0.100775 0.228058 ... 0.108607
std 0.092576 0.161160 ... 0.112260
min 0.000000 0.000000 ... 0.000000
25% 0.050400 0.083300 ... 0.045700
50% 0.075600 0.208300 ... 0.074900
75% 0.121000 0.375000 ... 0.130300
max 1.000000 1.000000 ... 1.000000
LIVINGAREA_MODE LIVINGAREA_AVG FLOORSMAX_MEDI FLOORSMAX_AVG \
count 153161.000000 153161.000000 154491.000000 154491.000000
mean 0.105975 0.107399 0.225897 0.226282
std 0.111845 0.110565 0.145067 0.144641
min 0.000000 0.000000 0.000000 0.000000
25% 0.042700 0.045300 0.166700 0.166700
50% 0.073100 0.074500 0.166700 0.166700
75% 0.125200 0.129900 0.333300 0.333300
max 1.000000 1.000000 1.000000 1.000000
FLOORSMAX_MODE YEARS_BEGINEXPLUATATION_AVG \
count 154491.000000 157504.000000
mean 0.222315 0.977735
std 0.143709 0.059223
min 0.000000 0.000000
25% 0.166700 0.976700
50% 0.166700 0.981600
75% 0.333300 0.986600
max 1.000000 1.000000
YEARS_BEGINEXPLUATATION_MEDI YEARS_BEGINEXPLUATATION_MODE \
count 157504.000000 157504.000000
mean 0.977752 0.977065
std 0.059897 0.064575
min 0.000000 0.000000
25% 0.976700 0.976700
50% 0.981600 0.981600
75% 0.986600 0.986600
max 1.000000 1.000000
TOTALAREA_MODE
count 159080.000000
mean 0.102547
std 0.107462
min 0.000000
25% 0.041200
50% 0.068800
75% 0.127600
max 1.000000
[8 rows x 45 columns]
#View 2 of the columns' data from the summary metrics info above
#Identify the two columns
two_columns = ["COMMONAREA_MEDI", "NONLIVINGAPARTMENTS_AVG"]
#create a dataframe of those columns where the rows do not contain null values
two_columns_df = applicationtrainDF[applicationtrainDF[two_columns].notnull().all(axis=1)]
#display the dataframe of the two columns
print(two_columns_df[two_columns])
COMMONAREA_MEDI NONLIVINGAPARTMENTS_AVG 0 0.0144 0.0000 1 0.0608 0.0039 13 0.0585 0.0000 14 0.1150 0.0193 18 0.0018 0.0000 ... ... ... 307495 0.0137 0.0039 307505 0.1441 0.0154 307506 0.0203 0.0753 307507 0.0022 0.0000 307508 0.0124 0.0000 [88507 rows x 2 columns]
As we progress in our model building process, we need to analyze the greater than 40% null value variables and how they interact with the model using a subsample of the training dataset. The sample should contain the rows of these variables that are not null to see if they are statiscally significant to the model. If they are stasticially significant, we can then choose to either impute the data or build our model off only the subsample of the dataset. If we were to use only 30% of the training dataset, we'd still have a sample size of around 92,000.
#Summary Metrics of the training set
# Filter columns with UNDER 40% null values
missingdata_columns2 = missing_data_train[missing_data_train['Percentage'] < 40].index
# Calculate summary metrics for filtered columns
summary_metrics2 = applicationtrainDF[missingdata_columns2].describe(include=np.number)
#Display summary metrics
print(summary_metrics2)
EXT_SOURCE_3 AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_DAY \
count 246546.000000 265992.000000 265992.000000
mean 0.510853 0.034362 0.007000
std 0.194844 0.204685 0.110757
min 0.000527 0.000000 0.000000
25% 0.370650 0.000000 0.000000
50% 0.535276 0.000000 0.000000
75% 0.669057 0.000000 0.000000
max 0.896010 8.000000 9.000000
AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT \
count 265992.000000 265992.000000
mean 0.267395 0.265474
std 0.916002 0.794056
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 0.000000
max 27.000000 261.000000
AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_YEAR \
count 265992.000000 265992.000000
mean 0.006402 1.899974
std 0.083849 1.869295
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 1.000000
75% 0.000000 3.000000
max 4.000000 25.000000
DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE \
count 306490.000000 306490.000000
mean 0.143421 1.405292
std 0.446698 2.379803
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 2.000000
max 34.000000 344.000000
DEF_60_CNT_SOCIAL_CIRCLE OBS_30_CNT_SOCIAL_CIRCLE EXT_SOURCE_2 \
count 306490.000000 306490.000000 3.068510e+05
mean 0.100049 1.422245 5.143927e-01
std 0.362291 2.400989 1.910602e-01
min 0.000000 0.000000 8.173617e-08
25% 0.000000 0.000000 3.924574e-01
50% 0.000000 0.000000 5.659614e-01
75% 0.000000 2.000000 6.636171e-01
max 24.000000 348.000000 8.549997e-01
AMT_GOODS_PRICE AMT_ANNUITY CNT_FAM_MEMBERS DAYS_LAST_PHONE_CHANGE
count 3.072330e+05 307499.000000 307509.000000 307510.000000
mean 5.383962e+05 27108.573909 2.152665 -962.858788
std 3.694465e+05 14493.737315 0.910682 826.808487
min 4.050000e+04 1615.500000 1.000000 -4292.000000
25% 2.385000e+05 16524.000000 2.000000 -1570.000000
50% 4.500000e+05 24903.000000 2.000000 -757.000000
75% 6.795000e+05 34596.000000 3.000000 -274.000000
max 4.050000e+06 258025.500000 20.000000 0.000000
When examining the variables that contain less than 40% null values, we can see that we need to take care of some possible outliers in the data. The max values listed on the summary stastics for some of the variables is well above 2 standard deviations from the mean.
unique_value = (applicationtrainDF['TARGET']).unique()
print('unique_value:', unique_value)
target_total_count = len(applicationtrainDF['TARGET'])
print('target_total_count:',target_total_count)
target_0_count = len(applicationtrainDF[applicationtrainDF['TARGET'] == 0])
print('target_0_count:',target_0_count)
target_1_count = len(applicationtrainDF[applicationtrainDF['TARGET'] == 1])
print('target_1_count:',target_1_count)
# Calculate the class distribution or Percentage of occurance
target_counts = applicationtrainDF['TARGET'].value_counts()
class_distribution = target_counts / target_counts.sum() * 100
print("\n", class_distribution)
unique_value: [1 0] target_total_count: 307511 target_0_count: 282686 target_1_count: 24825 TARGET 0 91.927118 1 8.072882 Name: count, dtype: float64
# Installing necessary packages
# !pip install matplotlib
# !pip install plotly
# !pip install seaborn
Collecting seaborn Using cached seaborn-0.12.2-py3-none-any.whl (293 kB) Requirement already satisfied: numpy!=1.24.0,>=1.17 in c:\users\jusha\new folder\lib\site-packages (from seaborn) (1.24.3) Requirement already satisfied: pandas>=0.25 in c:\users\jusha\new folder\lib\site-packages (from seaborn) (2.0.2) Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in c:\users\jusha\new folder\lib\site-packages (from seaborn) (3.7.1) Requirement already satisfied: pillow>=6.2.0 in c:\users\jusha\new folder\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.4.0) Requirement already satisfied: python-dateutil>=2.7 in c:\users\jusha\new folder\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\jusha\new folder\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4) Requirement already satisfied: cycler>=0.10 in c:\users\jusha\new folder\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0) Requirement already satisfied: packaging>=20.0 in c:\users\jusha\new folder\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (23.0) Requirement already satisfied: contourpy>=1.0.1 in c:\users\jusha\new folder\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.1.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\jusha\new folder\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.40.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\jusha\new folder\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9) Requirement already satisfied: pytz>=2020.1 in c:\users\jusha\new folder\lib\site-packages (from pandas>=0.25->seaborn) (2022.7) Requirement already satisfied: tzdata>=2022.1 in c:\users\jusha\new folder\lib\site-packages (from pandas>=0.25->seaborn) (2023.3) Requirement already satisfied: six>=1.5 in c:\users\jusha\new folder\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0) Installing collected packages: seaborn Successfully installed seaborn-0.12.2
import matplotlib.pyplot as plt
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sns
# Gender Distribution
plt.figure(figsize=(8, 6))
gender_counts = two_columns_df['CODE_GENDER'].value_counts()
plt.bar(gender_counts.index, gender_counts.values)
plt.xlabel('Gender')
plt.ylabel('Count')
plt.title('Gender Distribution')
for i, v in enumerate(gender_counts.values):
plt.text(i, v, str(v), ha='center', va='bottom', fontweight='bold')
plt.show()
gender_summary = gender_counts.reset_index().rename(columns={'index': 'Gender', 'Gender': 'Count'})
#Education of the applicants
temp = two_columns_df["NAME_EDUCATION_TYPE"].value_counts()
df = pd.DataFrame({'labels': temp.index,
'values': temp.values})
fig = px.pie(df, names='labels', values='values', title='Education of Applicants', hole=0.5)
fig.show()
#Income_type of the applicants
temp = two_columns_df["NAME_INCOME_TYPE"].value_counts()
df = pd.DataFrame({'labels': temp.index,
'values': temp.values})
fig = px.pie(df, names='labels', values='values', title='Applicants Income Type', hole=0.5)
fig.show()
# Plot the distribution of ages in years
plt.figure(figsize=(8, 6))
plt.hist(np.abs(two_columns_df['DAYS_BIRTH'] / 365), edgecolor='k', bins=25)
plt.title('Age of Client')
plt.xlabel('Age (years)')
plt.ylabel('Count')
plt.show()
# Who accompanied the client distrinution
temp = two_columns_df["NAME_TYPE_SUITE"].value_counts()
trace = go.Bar(
x=temp.index,
y=(temp / temp.sum()) * 100,
text=temp.values,
textposition='auto'
)
data = [trace]
layout = go.Layout(
title="Who accompanied client when applying for the application in %",
xaxis=dict(
title='Name of type of the Suite',
#tickangle=45,
tickfont=dict(
size=14,
color='rgb(107, 107, 107)'
)
),
yaxis=dict(
title='Count of Name of type of the Suite in %',
titlefont=dict(
size=16,
color='rgb(107, 107, 107)'
),
tickfont=dict(
size=14,
color='rgb(107, 107, 107)'
)
)
)
fig = go.Figure(data=data, layout=layout)
fig.show()
# Organization Type Distribution
temp = two_columns_df["ORGANIZATION_TYPE"].value_counts()
trace = go.Bar(
x=temp.index,
y=(temp / temp.sum()) * 100,
marker=dict(
color=(temp / temp.sum()) * 100,
colorscale='Greens',
reversescale=True
),
)
data = [trace]
layout = go.Layout(
title="Top industry of the seller in %",
xaxis=dict(
title='Industry Name',
tickangle=45,
tickfont=dict(
size=14,
color='rgb(107, 107, 107)'
)
),
yaxis=dict(
title='Count in %',
titlefont=dict(
size=16,
color='rgb(107, 107, 107)'
),
tickfont=dict(
size=14,
color='rgb(107, 107, 107)'
)
)
)
fig = go.Figure(data=data, layout=layout)
fig.show()
# Housing type distribution
housing_type_counts = two_columns_df['NAME_HOUSING_TYPE'].value_counts()
ax = housing_type_counts.plot(kind='bar')
plt.title('Count of Housing Types')
plt.xlabel('Housing Type')
plt.ylabel('Count')
for p in ax.patches:
ax.annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2, p.get_height()), ha='center', va='bottom')
plt.show()
# Count plot for client occupation type
plt.figure(figsize=(12, 6))
ax = sns.countplot(data=applicationtrainDF, x='OCCUPATION_TYPE')
plt.xlabel('Client Occupation Type')
plt.ylabel('Count of Loan Applicants')
plt.title('Count of Loan Applicants by Client Occupation Type')
plt.xticks(rotation=45)
for p in ax.patches:
ax.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
plt.show()
DATA SUMMARY: This project constitutes 6 datasets in csv format namely, application_{train|test}.csv which is the main dataset, which has an active relation to the datasets previous_application.csv and bureau.csv with the SK_ID_CURR column. This column in the main dataset is unique and acts as an identifier for all loan applications. Bereau.csv dataset is further mapped one-to-one to bureau_balance.csv dataset with the SK_ID_BUREAU column and previous_application.csv is further mapped to 3 other datasets POS_CASH_balance.csv, installments_payments.csv and credit_card_balance.csv with the column SK_ID_Prev and both these are active relations.
The target variable in Home Credit default risk project is TARGET column in application_train.csv dataset containing binary values, 1 for clients with payment difficulties and 0 for all other situations. 91% of the total records have 0 while remaining 9% has 1 as TARGET value. Based on this column, the project attempts to identify which all clients are at higher risk of credit default.
Within the analysis the steps taken are to find the shapes of the dataframes, examining the first 5 rows each dataframe using the head() function, discovering common columns that each dataframe with each other, print the info for each of the dataframe variable objects, get summary statistics for each of the variables in teh dataframes, gather information about the categorical variables used in the dataframes, and printing lists of each of the unique value in the categorical variables of the data frames.
Both the application.train dataset and application.test dataset contain missing values. Deciding whether to remove the rows that contain missing values, excluding certain variables, or imputing the missing values can potentionally have a high impact on the model's ability to predict the target variable accurately. Considering columns to imputed which are having missing value percentage above 40% is a good approach for this project.
The projects' visualization displays most applicants are female and stand at nearly 60000 compared to male at 29000 and most clients are middle-aged as per the age distribution. Most applicants are living in apartments and most applicants are of occupation 'laborers'.
Each of the dataframes has an interesting way of interacting with one another that we can use to determine our target variable of whether the borrower is trustworthy or not within our training set.
After EDA, the next of the project will be to perform feature engineering, obtaining correlation between independent variables and modelling predictive algorithm using strongest predictors and target variable.
Justin worked on the exploration and description of the dataset. Justin first started by creating tables that had the data descriptions from Kaggle. Justin then created each of the data frames by reading in the CSVs. Then using the data frames created, he viewed their shape, then viewing each of the top 5 rows of the data frames. He then used a function to view all of the common columns that appeared multiple times across the data frames. He then examined the categorical vs numeric data that was found within each data frame by looking at the summary stats of the numeric data and then looking at the proportions and unique values of the categorical data to see a vector of all possible classifications for categorical variables.
Rachel Butterfield explored the scope of missing data in the training and test datasets. She examined all the variables in both sets by looking at the total count and percentage of missing values, examining a handful of the variables to see what the values were, identified possible outliers in the data. She also provided some strategic suggestions as to how to manage the missing data as we continue building out model.
Based on the data provided, I developed visualizations such as plots and summary tables. I discovered correlations between variables by employing statistical approaches and visualizations. I made certain that our visualizations were clear and appropriately labeled. I also gave brief descriptions of the plots and tables to emphasize key findings. Overall, my contributions during the EDA phase included developing visualizations, evaluating data, and communicating effectively with the team.
Debayan Dutta worked on exploring the data distribution and distribution of target varible in the main dataset. Developed the distribution of Occupation type bar graph and added the doughnut chart visualizing applicants income_type. Debayan also developed the summary and inferences of the EDA step and also devised a roadmap for future steps we can take for the project in the modelling phase.